1 + 1[1] 2
SSPS4102 Data Analytics in the Social Sciences
SSPS6006 Data Analytics for Social Research
Semester 1, 2026
Last updated: 2026-01-22
I would like to acknowledge the Traditional Owners of Australia and recognise their continuing connection to land, water and culture. The University of Sydney is located on the land of the Gadigal people of the Eora Nation. I pay my respects to their Elders, past and present.
By the end of this lecture, you will be able to:
Definition
Reproducibility means that everything you did—all of it, end-to-end—can be independently redone by someone else.
This includes:
For science:
For you:
Many published findings cannot be reproduced:
The problem
Poor documentation, lost data, and undocumented decisions make it impossible to verify or build on previous work.
“The use of projects is required to meet the minimal level of reproducibility expected of credible work.”
Definition
Quarto integrates code and natural language in a way called “literate programming”. It combines text, code, and output in a single document.
Key features:
Advantages:
For this course:
In RStudio:
Two editing modes
Start with Source to learn the syntax!
A Quarto document has three parts:
---)
## YAML header basics
The YAML header controls document settings:
```yaml
---
title: "My document"
author: "Rohan Alexander"
date: today
format: html
---
Common options:
title, author, date — Document metadataformat: html or format: pdf — Output typeabstract — Summary textInclude a bibliography file in the YAML:
Then cite in your text:
@citeR produces: R Core Team (2023)[@citeR] produces: (R Core Team 2023)Finding citations
Use Google Scholar or doi2bib to get BibTeX entries.
Emphasis:
*italic* → italic**bold** → boldHeaders:
# First level
## Second level
### Third level
Lists:
- Item 1
- Item 2
+ Sub-item
Links:
[text](https://url.com)
Code chunks contain R code that will execute:
Control how chunks behave with special comments:
| Option | Effect |
|---|---|
echo: false |
Hide the code, show output |
eval: false |
Show code, don’t run it |
include: false |
Run code, hide everything |
warning: false |
Suppress warnings |
message: false |
Suppress messages |
Figure 1: A simple scatterplot
Reference figures and tables by their label:
@fig-labelname → Figure 1@tbl-labelname → Table 1@eq-labelname → Equation 1Naming convention
Labels must start with the type prefix:
fig- for figurestbl- for tableseq- for equationsClick the Render button (or press Ctrl/Cmd + Shift + K)
This will:
Common error
The Quarto document must be self-contained. Objects in your R environment are not automatically available—you must load data within the document.
The problem with setwd()
Using setwd("C:/Users/yourname/Documents/project/") means:
R Projects solve this by making all file paths relative to the project folder.
An R Project is a folder with a special .Rproj file that tells RStudio:
“The use of R Projects enables ‘reliable, polite behavior across different computers or users and over time’.”
In RStudio:
Good project names
australian_elections_2022my_project/
├── my_project.Rproj
├── README.md
├── inputs/
│ ├── data/
│ │ └── raw_data.csv
│ └── literature/
├── outputs/
│ ├── data/
│ │ └── cleaned_data.csv
│ └── paper/
│ ├── paper.qmd
│ └── references.bib
└── scripts/
├── 00-simulate_data.R
├── 01-download_data.R
├── 02-clean_data.R
└── 03-test_data.R
inputs/
outputs/
scripts/
README.md
Every project needs a README that explains:
Template
Use the Social Science Data Editors template as a starting point.
Your working directory is where R looks for files by default.
Don’t use setwd()!
With R Projects, the working directory is automatically set to the project folder. Using setwd() breaks reproducibility.
CSV (Comma-Separated Values) is the most common data format:
read_csv() vs read.csv()
read_csv() from the tidyverse is faster and handles data types better. We’ll use it throughout this course.
header argumentMany data files have column names in the first row:
Tidyverse default
read_csv() assumes header = TRUE by default—one less thing to remember!
Save your processed data:
After loading data, always check it:
# A tibble: 3 × 3
name age score
<chr> <dbl> <dbl>
1 Alice 25 85.5
2 Bob 30 92
3 Carol 35 78.5
Two ways to access a column:
A vector is a list of items of the same type:
Mathematical operations work element-by-element:
R represents missing data as NA:
Access specific elements with [ ]:
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
Common mistake
= is assignment, == is comparison!
x = 5 assigns the value 5 to xx == 5 asks “is x equal to 5?”The old way:
analysis.Ranalysis_v2.Ranalysis_final.Ranalysis_final_FINAL.Ranalysis_final_FINAL_v2.RThe Git way:
analysis.RDefinition
Git is a version control system that tracks changes to files over time. It keeps snapshots of your project that you can return to.
Key concepts:
Definition
GitHub is a website that hosts Git repositories online. It makes it easy to share code, collaborate, and back up your work.
Think of it as:
Check if Git is installed (in Terminal):
Username advice
Your username becomes part of your professional profile. Choose something:
RStudio has a Git pane that makes this easier:
Start with Pull
Always pull before you start working to get any changes.
A commit message should explain what changed and why:
Add graphs to data section
- Added unemployment line chart
- Added inflation bar chart
- Updated figure references in text
Commit regularly
Frequent, small commits are better than rare, large ones. They make it easier to find and fix problems.
.gitignore fileSome files should not be tracked:
.DS_Store)List these in a .gitignore file:
# Ignore data files
*.csv
data/
# Ignore system files
.DS_Store
Keep your PAT secret
Never include your PAT in any R script or document!
Option 1: Create on GitHub first
Git can be confusing
“It is normal to be intimidated by Git and GitHub. Many data scientists only know a little about how to use it, and that is okay.”
The key commands are: pull, commit, push.
Everything else can be learned as needed!
Setup (once)
Each session
assignment_1/
├── assignment_1.Rproj
├── README.md
├── .gitignore
├── inputs/
│ └── data/
│ └── raw_survey.csv
├── outputs/
│ ├── data/
│ │ └── cleaned_survey.csv
│ └── paper/
│ ├── assignment_1.qmd
│ └── references.bib
└── scripts/
├── 01-download_data.R
└── 02-clean_data.R
✅ Used an R Project (no setwd())
✅ All data loaded from files (not environment)
✅ All packages loaded at the start
✅ Code runs from top to bottom
✅ Results documented in Quarto
✅ Project tracked with Git/GitHub
✅ README explains how to run the code
Telling Stories with Data:
Regression and Other Stories:
read_csv() and write_csv() handle data filesWeek 3: Data Acquisition and Measurement
Before next week
Office hours:
Email: